Section 1-5 - Final Checks

We now arrive at the last piece of puzzle - comparing the mean against the median when filling in the training data.

Pandas - Extracting data


In [9]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/train.csv')

Pandas - Cleaning data


In [10]:
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)

As we don't know whether the mean or the median will do better, we calculate both.


In [11]:
age_mean = df['Age'].mean()
age_median = df['Age'].median()

In [12]:
from scipy.stats import mode

mode_embarked = mode(df['Embarked'])[0][0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)

df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)

pd.get_dummies(df['Embarked'], prefix='Embarked').head(10)
df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked')], axis=1)

df = df.drop(['Sex', 'Embarked'], axis=1)

cols = df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]
df = df[cols]

df = df.fillna(-1)

train_data = df.values

Scikit-learn - Training the model


In [13]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

imputer = Imputer(missing_values=-1)

classifier = RandomForestClassifier(n_estimators=100)

pipeline = Pipeline([
    ('imp', imputer),
    ('clf', classifier),
])

We now include the mean-median comparison into our pipeline.


In [14]:
parameter_grid = {
    'imp__strategy': ['mean', 'median'],
    'clf__max_features': [0.5, 1],
    'clf__max_depth': [5, None],
}

In [15]:
grid_search = GridSearchCV(pipeline, parameter_grid, cv=5, verbose=3)
grid_search.fit(train_data[0:,2:], train_data[0:,0])


Fitting 5 folds for each of 8 candidates, totalling 40 fits
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5, imp__strategy=mean .....
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, imp__strategy=mean, score=0.821229 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5, imp__strategy=mean .....
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, imp__strategy=mean, score=0.803371 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5, imp__strategy=mean .....
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, imp__strategy=mean, score=0.797753 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5, imp__strategy=mean .....
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, imp__strategy=mean, score=0.876404 -   0.2s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5, imp__strategy=mean .....
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, imp__strategy=mean, score=0.837079 -   0.3s
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.3s
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5, imp__strategy=median ...
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, imp__strategy=median, score=0.821229 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5, imp__strategy=median ...
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, imp__strategy=median, score=0.792135 -   0.3s
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5, imp__strategy=median ...
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, imp__strategy=median, score=0.814607 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5, imp__strategy=median ...
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, imp__strategy=median, score=0.848315 -   0.3s
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5, imp__strategy=median ...
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, imp__strategy=median, score=0.825843 -   0.3s
[GridSearchCV] clf__max_features=1, clf__max_depth=5, imp__strategy=mean .......
[GridSearchCV]  clf__max_features=1, clf__max_depth=5, imp__strategy=mean, score=0.843575 -   0.2s
[GridSearchCV] clf__max_features=1, clf__max_depth=5, imp__strategy=mean .......
[GridSearchCV]  clf__max_features=1, clf__max_depth=5, imp__strategy=mean, score=0.775281 -   0.2s
[GridSearchCV] clf__max_features=1, clf__max_depth=5, imp__strategy=mean .......
[GridSearchCV]  clf__max_features=1, clf__max_depth=5, imp__strategy=mean, score=0.747191 -   0.2s
[GridSearchCV] clf__max_features=1, clf__max_depth=5, imp__strategy=mean .......
[GridSearchCV]  clf__max_features=1, clf__max_depth=5, imp__strategy=mean, score=0.842697 -   0.2s
[GridSearchCV] clf__max_features=1, clf__max_depth=5, imp__strategy=mean .......
[GridSearchCV]  clf__max_features=1, clf__max_depth=5, imp__strategy=mean, score=0.820225 -   0.2s
[GridSearchCV] clf__max_features=1, clf__max_depth=5, imp__strategy=median .....
[GridSearchCV]  clf__max_features=1, clf__max_depth=5, imp__strategy=median, score=0.849162 -   0.2s
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
[GridSearchCV] clf__max_features=1, clf__max_depth=5, imp__strategy=median .....
[GridSearchCV]  clf__max_features=1, clf__max_depth=5, imp__strategy=median, score=0.769663 -   0.2s
[GridSearchCV] clf__max_features=1, clf__max_depth=5, imp__strategy=median .....
[GridSearchCV]  clf__max_features=1, clf__max_depth=5, imp__strategy=median, score=0.752809 -   0.2s
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
[GridSearchCV] clf__max_features=1, clf__max_depth=5, imp__strategy=median .....
[GridSearchCV]  clf__max_features=1, clf__max_depth=5, imp__strategy=median, score=0.837079 -   0.2s
[GridSearchCV] clf__max_features=1, clf__max_depth=5, imp__strategy=median .....
[GridSearchCV]  clf__max_features=1, clf__max_depth=5, imp__strategy=median, score=0.820225 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None, imp__strategy=mean ..
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, imp__strategy=mean, score=0.837989 -   0.5s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None, imp__strategy=mean ..
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, imp__strategy=mean, score=0.808989 -   0.4s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None, imp__strategy=mean ..
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, imp__strategy=mean, score=0.797753 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None, imp__strategy=mean ..
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, imp__strategy=mean, score=0.825843 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None, imp__strategy=mean ..
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, imp__strategy=mean, score=0.842697 -   0.4s
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None, imp__strategy=median 
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, imp__strategy=median, score=0.832402 -   0.5s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None, imp__strategy=median 
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, imp__strategy=median, score=0.797753 -   0.5s
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None, imp__strategy=median 
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, imp__strategy=median, score=0.786517 -   0.4s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None, imp__strategy=median 
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, imp__strategy=median, score=0.792135 -   0.4s
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None, imp__strategy=median 
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, imp__strategy=median, score=0.859551 -   0.4s
[GridSearchCV] clf__max_features=1, clf__max_depth=None, imp__strategy=mean ....
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, imp__strategy=mean, score=0.810056 -   0.3s
[GridSearchCV] clf__max_features=1, clf__max_depth=None, imp__strategy=mean ....
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, imp__strategy=mean, score=0.797753 -   0.4s
[GridSearchCV] clf__max_features=1, clf__max_depth=None, imp__strategy=mean ....
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, imp__strategy=mean, score=0.758427 -   0.3s
[GridSearchCV] clf__max_features=1, clf__max_depth=None, imp__strategy=mean ....
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, imp__strategy=mean, score=0.814607 -   0.3s
[GridSearchCV] clf__max_features=1, clf__max_depth=None, imp__strategy=mean ....
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, imp__strategy=mean, score=0.831461 -   0.3s
[Parallel(n_jobs=1)]: Done  32 jobs       | elapsed:    9.7s
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
[GridSearchCV] clf__max_features=1, clf__max_depth=None, imp__strategy=median ..
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, imp__strategy=median, score=0.821229 -   0.3s
[GridSearchCV] clf__max_features=1, clf__max_depth=None, imp__strategy=median ..
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, imp__strategy=median, score=0.786517 -   0.3s
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
[GridSearchCV] clf__max_features=1, clf__max_depth=None, imp__strategy=median ..
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, imp__strategy=median, score=0.758427 -   0.3s
[GridSearchCV] clf__max_features=1, clf__max_depth=None, imp__strategy=median ..
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, imp__strategy=median, score=0.803371 -   0.3s
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py:307: DeprecationWarning: using a boolean instead of an integer will result in an error in the future
  median[np.ma.getmask(median_masked)] = np.nan
[GridSearchCV] clf__max_features=1, clf__max_depth=None, imp__strategy=median ..
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, imp__strategy=median, score=0.859551 -   0.3s
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:   12.0s finished
Out[15]:
GridSearchCV(cv=5,
       estimator=Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values=-1, strategy='mean', verbose=0)), ('clf', RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0))]),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'clf__max_features': [0.5, 1], 'clf__max_depth': [5, None], 'imp__strategy': ['mean', 'median']},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=3)

In [16]:
sorted(grid_search.grid_scores_, key=lambda x: x.mean_validation_score)
grid_search.best_score_
grid_search.best_params_


Out[16]:
{'clf__max_depth': 5, 'clf__max_features': 0.5, 'imp__strategy': 'mean'}

As before, we replace the -1 values in the column Age by the better performer.


In [17]:
df['Age'] = df['Age'].map(lambda x: age_mean if x == -1 else x)

In [18]:
train_data = df.values

In [19]:
model = RandomForestClassifier(n_estimators = 100, max_features=0.5, max_depth=5)
model = model.fit(train_data[0:,2:],train_data[0:,0])

Scikit-learn - Making predictions


In [20]:
df_test = pd.read_csv('../data/test.csv')

df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

Similarly we fill in the NAs in the test data with the better performer.


In [21]:
df_test['Age'] = df_test['Age'].fillna(age_mean)

In [22]:
fare_means = df.pivot_table('Fare', index='Pclass', aggfunc='mean')
df_test['Fare'] = df_test[['Fare', 'Pclass']].apply(lambda x:
                            fare_means[x['Pclass']] if pd.isnull(x['Fare'])
                            else x['Fare'], axis=1)

df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')],
                axis=1)

df_test = df_test.drop(['Sex', 'Embarked'], axis=1)

test_data = df_test.values

output = model.predict(test_data[:,1:])


/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/pandas/core/index.py:503: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
  type(self).__name__),FutureWarning)

Pandas - Preparing for submission


In [23]:
result = np.c_[test_data[:,0].astype(int), output.astype(int)]


df_result = pd.DataFrame(result[:,0:2], columns=['PassengerId', 'Survived'])
df_result.to_csv('../results/titanic_1-5.csv', index=False)